第 10 天：模型訓練第五步｜特徵工程

16th鐵人賽

John Wu

2024-09-01 09:41:10

693 瀏覽

分享至

既然我們劃分好數據組了，接下來就是要讓這些訓練資料更好的讓模型吸收以及學習，因此我們需要調整一下資料，讓他變成機器學習模型比較好理解的形狀，那就會需要透過「特徵工程」。

什麼是特徵工程呢？

請 ChatGPT 幫我從網路上找到的定義是：「特徵工程是指通過一系列技術手段，將原始數據轉換成更能夠描述模型輸出或目標變量的特徵，這些特徵會被用來訓練機器學習模型。好的特徵可以幫助模型更好地學習數據中的模式和規律，從而提高預測的準確性和穩定性。」

從這樣的解釋就可以知道特徵工程最重要的核心就是「創造數據的模式和規律，方便模型學習來提升他的準確性和穩定性」。但其實也不一定要做，只是做了可以更提升模型的準確性喔！

那以我們目前的房價預測來說，就可以透過特徵工程來調整我們目前的資料：

import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 重命名列函數
def make_unique_columns(columns):
    seen = {}
    new_columns = []
    for item in columns:
        if item in seen:
            seen[item] += 1
            new_item = f"{item}_{seen[item]}"
            while new_item in seen:
                seen[item] += 1
                new_item = f"{item}_{seen[item]}"
            new_columns.append(new_item)
        else:
            seen[item] = 0
            new_columns.append(item)
    return new_columns

# 應用重命名
df_cleaned.columns = make_unique_columns(df_cleaned.columns)

# 更新特徵列表
numeric_features = [col for col in df_cleaned.columns if re.match(r'房屋面積\(坪\).*|房齡\(年\).*', col)]
categorical_features = ['位置']
custom_features = ['單位面積價格', '是否新房', '位置平均房價', '位置房價標準差']

# 1. 數值特徵處理
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 2. 類別特徵處理
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 3. 特徵組合
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('custom', 'passthrough', custom_features)
    ])

# 4. 創建新特徵
base_area_col = [col for col in numeric_features if '房屋面積(坪)' in col][0]
base_age_col = [col for col in numeric_features if '房齡(年)' in col][0]

df_cleaned['單位面積價格'] = df_cleaned['房價(萬)'] / df_cleaned[base_area_col]
df_cleaned['是否新房'] = (df_cleaned[base_age_col] <= 5).astype(int)

# 5. 對數轉換
df_cleaned['log_房價'] = np.log1p(df_cleaned['房價(萬)'])

# 6. 多項式特徵（以房屋面積為例）
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df_cleaned[[base_area_col]])
poly_feature_names = poly.get_feature_names_out([base_area_col])
for name, feature in zip(poly_feature_names, poly_features.T):
    if name not in df_cleaned.columns:
        df_cleaned[name] = feature

# 7. 位置相關統計特徵
df_cleaned['位置平均房價'] = df_cleaned.groupby('位置')['房價(萬)'].transform('mean')
df_cleaned['位置房價標準差'] = df_cleaned.groupby('位置')['房價(萬)'].transform('std')

# 8. 應用預處理器
X = df_cleaned.drop(['房價(萬)', 'log_房價'], axis=1)  # 注意：移除目標變量
y = df_cleaned['log_房價']  # 使用對數轉換後的房價作為目標

X_processed = preprocessor.fit_transform(X)

# 獲取所有特徵名稱
numeric_feature_names = numeric_features
categorical_feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
custom_feature_names = custom_features

all_feature_names = numeric_feature_names + categorical_feature_names + custom_feature_names

print("處理後的特徵:", all_feature_names)
print("處理後的數據形狀:", X_processed.shape)

除了原先的資料外，你也可以透過特徵工程這個環節創造出新的特徵，只要是可以幫助你的模型學習你想要中的判斷方式，對於模型來說都是有正向的效果，那都很適合在這個環節把它作為一個新的特徵的產出。

現在我們需要的資料也準備好了，也把資料整理成適合訓練的規格，接下來就是要選擇適合的訓練模型啦！訓練模型百百種，像是決策樹、隨機森林、神經網絡等，接下來就會教大家怎麼判斷該怎麼選擇適合的模型了！